This report is organized as follows:

  1. Environment: for informational/peer review purposes
  2. Train Data: includes some exploratory data analysis on the training dataset.
  3. Trip Data: explores the structure of trip data, provides an efficient formula for calculating heading change, and proposes a rules-based logic for extracting turning and stopping features.
  4. Feature Extraction: here I implement the feature extraction logic in 3 key functions with some basic error handling. I also evaluate the feature extraction logic visually using a sample of trips, and propose some avenues for future methods/research. Finally, I evaluate the computational performance of this implementation.
  5. Modeling: The most substantial section, this includes the modeling workflow with sections for feature extraction, a simple imputation for missing data, model specification and hyperparameter tuning, and finally some post-modeling inference.
  6. Test predictions: informational, just forming predicted labels on the supplied test data.

1 Environment

##                               sysname                               release 
##                               "Linux"         "3.10.0-1160.24.1.el7.x86_64" 
##                               version                               machine 
## "#1 SMP Thu Mar 25 21:21:56 UTC 2021"                              "x86_64"
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## corrplot 0.84 loaded
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Loading required package: lattice
##  chr [1:1000] "0000.csv" "0001.csv" "0002.csv" "0003.csv" "0004.csv" ...
##  chr [1:876] "1000.csv" "1001.csv" "1002.csv" "1003.csv" "1004.csv" ...

2 Train data

We start by loading the training set and inspecting the contents.

## Rows: 1,000
## Columns: 16
## $ filename  <chr> "0000.csv", "0001.csv", "0002.csv", "0003.csv", "0004.csv",…
## $ feature1  <chr> "False", "False", "False", "False", "False", "False", "Fals…
## $ feature2  <chr> "False", "False", "False", "False", "False", "False", "Fals…
## $ feature3  <chr> "True", "False", "True", "True", "False", "False", "True", …
## $ feature4  <dbl> 5.209096, 4.450941, 5.396552, 4.970163, 5.266868, 5.423508,…
## $ feature5  <dbl> 9789.262, 10552.522, 10233.433, 10829.057, 10678.704, 8884.…
## $ feature6  <dbl> 30753.87, 33151.73, 32149.28, 34020.49, 33548.14, 27911.68,…
## $ feature7  <dbl> 0.0010104224, 0.0009997680, 0.0010145655, 0.0009929965, 0.0…
## $ feature8  <int> 5, 3, 6, 4, 6, 4, 6, 2, 7, 6, 5, 4, 6, 5, 7, 8, 7, 3, 8, 8,…
## $ feature9  <int> 13, 11, 13, 8, 11, 9, 11, 8, 15, 12, 5, 7, 9, 10, 8, 8, 12,…
## $ feature10 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ feature11 <dbl> 9.373984e+03, 4.251619e+04, 1.305321e+07, 1.131908e+03, 3.1…
## $ feature12 <dbl> 3.179204e-01, 2.229321e+00, 3.425951e+01, 2.576871e+01, 1.4…
## $ feature13 <dbl> 9.379193e+03, 4.252064e+04, 1.305322e+07, 1.136878e+03, 3.1…
## $ feature14 <dbl> 4.974085, 3.151531, 6.236594, 3.968008, 5.999782, 3.974651,…
## $ y         <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,…

We will convert feature1:feature3 to the factor class for modeling. We may also want y as a factor, depending on the libraries we use.

2.1 Correlation analysis

We plot the correlation matrix in order to understand linear relationships between the features

Let’s also check the 3 binary factors

## , , feature2 = False
## 
##         feature3
## feature1 False True
##    False   439  459
##    True     54   48

We make the following observations about the dataset:

  1. feature11 and feature13 are co-linear
  2. feature5 and feature6 are co-linear
  3. feature8 and feature14 are co-linear
  4. feature10 and feature2 are both zero-variance (all observations are “False” for feature2)
  5. the remaining features are near perfectly independent of each other

As a result, we will remove 4 features total (feature10, feature2, and one from each co-linear pair) without any loss of information.

2.2 Distribution analysis

Let’s also look at some distribution statistics for each feature. This may inform modeling decisions later on - method selection and feature transformations, for example.

##   feature1    feature3      feature4        feature6        feature7        
##  False:898   False:493   Min.   :2.116   Min.   :22599   Min.   :0.0009710  
##  True :102   True :507   1st Qu.:4.324   1st Qu.:29264   1st Qu.:0.0009934  
##                          Median :4.991   Median :31294   Median :0.0010007  
##                          Mean   :4.985   Mean   :31346   Mean   :0.0010004  
##                          3rd Qu.:5.675   3rd Qu.:33548   3rd Qu.:0.0010069  
##                          Max.   :8.819   Max.   :40275   Max.   :0.0010345  
##     feature9       feature12           feature13           feature14      
##  Min.   : 2.00   Min.   :      0.0   Min.   :4.000e+00   Min.   : 0.8254  
##  1st Qu.: 8.00   1st Qu.:      0.1   1st Qu.:8.010e+02   1st Qu.: 3.9806  
##  Median :10.00   Median :      2.7   Median :2.295e+04   Median : 4.9881  
##  Mean   :10.05   Mean   :   8845.5   Mean   :8.861e+08   Mean   : 4.9762  
##  3rd Qu.:12.00   3rd Qu.:     78.3   3rd Qu.:6.395e+05   3rd Qu.: 6.0227  
##  Max.   :22.00   Max.   :1572134.4   Max.   :6.070e+11   Max.   :10.3276  
##      y_fctr   
##  Class_0:690  
##  Class_1:310  
##               
##               
##               
## 

Most features are ‘well-behaved’. For example feature4 is reasonably symmetric without extreme outliers:

feature12 and feature13 stand out as being heavily right-skewed. See for example the density for feature12:

We may consider transforming these features later (for example, taking logs) depending on the modeling approach (some algorithms being invariant to monotonic transforms).

We also note that our response variable is binary with imbalanced classes.

3 Trip data

We will extract new features from the trip data files. Let’s start by getting a feel for a single trip time series.

Trips have a time for each reading which tends to be taken at a 1-second frequency, with possible exception at the start of the trip; they also have a speed reading (meters per second) and a heading (clockwise angle against magnetic north), although the nominal heading may not be as important as changes in heading (turning).

Noticing there may be missing data at the beginning and ending of a trip, we will impute missing values via NOCB (next obs. carried backward) for the start, and the reverse (LOCF) at the tail. This is not necessarily a strong approach for missing data in the middle of a series, though it has the virtue that it will not fail with error.

Let’s plot the trip series

The sharp spikes at the end of the trip stand out - they are due to the fact that with heading degrees, the difference between 359 and 0 is 1 degree, not 359.

3.1 Heading Change

We will use the following formula when measuring the degree of turn, denoted D:

\[D = (h_t - h_{t-1} + 540) \pmod{360} - 180\]

This optimized formulation avoids comparison logic around nominal heading and references the input values (current and previous heading) just once. Let’s take first-order differences to get a feel for acceleration, and we will use the formula above to express the change in heading.

3.2 Turning

Now we think about identifying turns. Our intuition tells us they should be characterized by:

  1. consecutive runs (given sampling) of substantial heading changes (first order differencing) of the same sign;
  2. the turn (run) should not be too long a duration, given the vehicle is moving (this may indicate road curvature rather than turning);
  3. the magnitude of the sum of heading differences within a run must exceed some threshold
    • stated another way, the difference between the ending and the starting heading should exceed some threshold
  4. they also should correspond with travel at some appreciable velocity (not idling in place - this could occur due to instrument/measurement error) but also below some threshold velocity (which could again indicate taking curves at high speed).

We will threshold a turn as having absolute heading magnitude change of 60 degrees or more. We will also exclude any instances where the average speed during the turn exceeded 20 m/s, or about 45 mph - the friction between road and tire is probably not sufficient to support turns at greater speed for most vehicles. Similarly, we will put an upper threshold on duration of the maneuver (45 seconds) so as not to misclassify long curves at lower speed.

For this trip we observe 9 turns as we have defined them.

4 Feature extraction

We now define several functions to prep data from a trip file and extract the features according to the logic above.

4.1 prep_trip()

The prep_trip() method will accept the filename input as a string, and the partition/sub-dir in which the file appears, defaulted to “train.” It returned a data.frame which is augmented with new columns and is taken as input by the other methods below.

prep_trip <- function(trip_file, partition = "train") {
  
  trip <- read.csv(paste0(partition, "/", trip_file))
  n <- nrow(trip)
  
  if(n == 0 | ncol(trip) != 3 |  ## no rows or wrong num of cols
     sum(is.na(trip[,2])) == n | sum(is.na(trip[,3])) == n | ## all missing data either col
     var(trip[, 2], na.rm = T) == 0 | var(trip[,3], na.rm = T) == 0) { ## zero variance column
    
    bad_trip <<- trip  ## return bad trip data to parent environment for a look-see
    stop(paste0("bad trip here: ", trip_file,
                "\n   Speed NA: ", sum(is.na(trip[,2])),
                "\n   Heading NA: ", sum(is.na(trip[,3])),
                "\n   Speed variance: ", var(trip[, 2], na.rm = T),
                "\n   Heading variance: ", var(trip[, 3], na.rm = T) ))
  } ## early exit
  
  trip %>%
    imputeTS::na_locf() %>%
    ## turns
    mutate(heading_diff = (heading_degrees - lag(heading_degrees) + 540) %% 360 - 180,
           run = sign(heading_diff) * (abs(heading_diff) > .25)) %>%
    group_by(runid = {runid = rle(run); rep(seq_along(runid$lengths), runid$lengths)}) %>% 
    mutate(tot_heading_diff = (last(heading_degrees) - first(heading_degrees) + 540) %% 360 - 180,
           avg_speed = mean(speed_meters_per_second),
           turn_ind = n() > 2 &
             abs(tot_heading_diff) >= 60 &
             avg_speed < 20 &
             n() < 45) %>%
    ungroup() %>%
    mutate(turn_amt = case_when(turn_ind ~ tot_heading_diff, TRUE ~ NA_real_)) %>%
    ## stops
    mutate(accel = speed_meters_per_second - lag(speed_meters_per_second),
           roll_accel_15 = lag(zoo::rollapply(accel, 15, min, align = "right", fill = NA)),
           stop_run = abs(speed_meters_per_second) < .05) %>%
    group_by(stop_runid = {stop_runid = rle(stop_run); rep(seq_along(stop_runid$lengths), stop_runid$lengths)}) %>%
    mutate(stop_ind = n() > 2 & max(stop_run) & first(roll_accel_15) < -.5) %>%
    ungroup() %>%
    return()
}

4.2 count_stops()

This function takes as input the result from the prep_data() method, and returns a length 1 vector with the count of stops in the trip.

4.3 count_turns()

This function takes as input the result from the prep_data() method, and returns a length 1 vector with the count of turns in the trip.

4.5 Evalute feature logic

I’ll take a sample of trips, 5 from each class, and apply the plotting function defined above to visually check for reasonable results.

The results seem reasonable, though not perfect. For example, in 0973.csv there appears to be a stop which is missed due to a noisy speed reading, which could be due to measurement error in the sensor or a driver who likes to play with the brake pedal at lights.

One avenue for improvement will be data smoothing - e.g. via smoothing splines, simple kernels/filters, kriging, etc. - as part of a pre-processing step.

Another avenue includes techniques based on the matrix profile - specifically, shapelet and motif discovery.1 These methods involve discovering subsequences within a series which are similar to each other and/or maximally separating of classes.

4.6 Computational Performance

For a single trip, we must first call prep_trip() which reads the relevant file from disk and returns a data.frame object. Then we call each of count_turns() and count_stops() which takes the data.frame as input and returns a length 1 vector for each. The computational time for 10 sequential iterations is shown below:

## [1] "Test performed for 10 files sequentially"
##    user  system elapsed 
##   0.527   0.013   0.541

For 10 trips clock time is about half a second, but CPU (system) time is a marginal fraction of this. Most of the overhead is due to instructions and disk I/O.

The process will scale over multiple files with ~O(n) (linear time) complexity.

5 Modeling

We turn now to the modeling analysis, right after we prepare our training data

5.1 Extract Training features

First let’s prepare our features for the training set

## prep_trip() failed for file 0195.csv with following error: 
## Error in prep_trip(f): bad trip here: 0195.csv
##    Speed NA: 0
##    Heading NA: 0
##    Speed variance: 0.365424790314973
##    Heading variance: 0

Bind the 2 new features to the training data, stop_count and turn_count. The result will be in original order so we do not need to join.

Trip 0195 failed due to zero variance in the heading vector. Counts will be NA for this record.

5.2 Impute NA

We will impute the missing values in order to use all observations in modeling.

## turn_count + stop_count ~ feature1 + feature3 + feature4 + feature6 + 
##     feature7 + feature9 + feature12 + feature13 + feature14
## <environment: 0x153953e0>
## 
## Missing value imputation by random forests
## 
##   Variables to impute:       turn_count, stop_count
##   Variables used to impute:  feature1, feature3, feature4, feature6, feature7, feature9, feature12, feature13, feature14
##  trn_cn  stp_cn
## iter 1:  1.0021  1.0014  

5.3 Model Tuning

We will use the random forest algorithm as implemented in the ranger library. We prefer a simple and computationally fast approach - random forest provides a strong off-the-shelf classifier with few hyperparameters requiring tuning (as compared to say gradient boosting or neural networks) and is extremely parallelizable.

The two main hyperparameters we will tune with repeated cross-validation are: the number of features to randomly select for each potential split (mtry); and the minimum number of observations required in each node (min.node.size). The latter controls model complexity and overfitting - we will not tune or use max.depth which also has the effect of reducing complexity of individual learners.

We will also evaluate two possible split rules - the traditional gini impurity, as well as more recent hellinger distance. The former favors splits which result in imbalanced class distribution within leaf nodes. Hellinger distance by contrast favors splits which result in an even class distribution within leaf nodes; the rule generally results in better (higher) sensitivity, with minimal loss in specificity.2

Setting up our grid and controls for tuning

Run the learning procedure

View the estimated AUC across our hyperparameter grid

We will select the best result from the above set of evaluations - hellinger distance split rule with mtry = 2 and min.node.size = 15

5.4 Final rf Model

Now train the model using the optimal hyperparameters. We will plot feature importance based on gini impurity. Note that our model returns class probabilities rather than class labels - we prefer this so we can choose the threshold for labeling later, which may not be 0.5 if we have unequal penalties from each type of misclassification (false negatives and false positives).

feature7 provide the highest predictive contribution, feature1 and feature3 provide the least amount of information. We also note that our bespoke features extracted from the trip data contain some signal of our response, though they are also not the strongest predictors. We will look closer at the nature of these contributions in the next section.

5.5 Model Inference

To provide fair estimates of predictive power we want to form predictions from our model on new (unseen) data. One nice feature of random forests is that they come with a built-in hold-out sample at each iteration in the form of the out-of-bag (OOB) samples. Ranger has already created out-of-bag predictions for us which we will use to calculate a confusion matrix and the more generalized AUC.

5.5.1 Confusion Matrix

We assume equal cost of misclassification for each class by setting our probability threshold at 0.50 for label prediction. If we knew that false positives were more “expensive” for us than false negatives or vice versa, then this threshold should be adjusted accordingly to minimize the total cost incurred from misclassification in production.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class_0 Class_1
##    Class_0     648     102
##    Class_1      42     208
##                                           
##                Accuracy : 0.856           
##                  95% CI : (0.8327, 0.8772)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6444          
##                                           
##  Mcnemar's Test P-Value : 8.803e-07       
##                                           
##             Sensitivity : 0.6710          
##             Specificity : 0.9391          
##          Pos Pred Value : 0.8320          
##          Neg Pred Value : 0.8640          
##              Prevalence : 0.3100          
##          Detection Rate : 0.2080          
##    Detection Prevalence : 0.2500          
##       Balanced Accuracy : 0.8050          
##                                           
##        'Positive' Class : Class_1         
## 

The specificity (= 1 - false positive rate) of our classifier is ~.94 - we do a good job of avoiding “false alarms” of mislabeling cases as positive. The sensitivity (= 1 - false negative rate, aka recall) of ~.67 tells us how good we are at detecting positive cases. Precision - the proportion of positive-labeled cases which are truly positive - of .83 tells us that most of our positive predictions are accurate. By setting the decision threshold at .50, we have sought to maximize overall accuracy, estimated as ~86%.

5.5.2 AUC

To get the decision matrix above, we had to choose a threshold (.50) to move from a predicted probability of an observation being in Class 1 to actually labeling the observation as a 1 or 0. The ROC curve shows the tradeoff between sensitivity and specificity as we vary the threshold between 0.0 and 1.0 for labeling a case positive. In a way then it is like a continuous confusion matrix.

## Setting levels: control = Class_0, case = Class_1
## Setting direction: controls < cases

The curve shows the tradeoff between the two types of misclassifications, with the red circle indicating how we have calibrated with a .50 decision threshold. We can increase our sensitivity and correctly label more positive cases, but we will increase our mis-labeling of the negative cases at the same time. Due to the class imbalance (negative cases are about 2/3 of the data), a 1 point reduction in specificity must be offset by at least a 2 point improvement in sensitivity, roughly speaking and still assuming equal cost of errors.

5.5.3 Predictors

We want to understand the nature of the relationship between our features and the response. Since there is just a handful of features, we can look at individual conditional expectation (ICE) plots for each. Like partial dependence plots, these show the average marginal effect of a feature, but in addition they show the marginal effect for individual observations in the training set. This helps us understand if there were important interactions between variables learned in our forest.

We are selecting 4 of the 11 features to plot below

## $feature7
## NULL
## 
## $feature4
## NULL
## 
## $turn_count
## NULL
## 
## $stop_count
## NULL

We observe that turn_count interacts with another feature in our model - the marginal effect of moving above 3 turns appears to depend on the value of some other variable. If we dig further we can find out what feature this is and use the information to specify interactions say in a parametric model - work for a future analysis.

The marginal effect of stop_count appears to be weak and perhaps counter-intuitive in its direction. We should revisit the extraction logic for this feature, or perhaps consider other features which more completely describe the nuance of driving behavior - repeated stopping at low speed, hard stopping, rapid deceleration to a near stop, etc.

6 Test predictions

Creating the test dataset

##   filename   feature1   feature3   feature4   feature6   feature7   feature9 
##          0          0          0          0          0          0          0 
##  feature12  feature13  feature14 stop_count turn_count 
##          0          0          0          0          0

Labeling with the final model. Note that we assume equal cost of misclassification for each class by setting our decision bound at 0.5. If we knew that false positives were more “expensive” to us than false negatives or vice versa, this threshold should be adjusted accordingly to minimize the cost incurred from misclassification.

8 References


  1. Yeh, et al.“Time series joins, motifs, discords, and shapelets: a unifying view that exploits the matrix profile.” 2017. https://www.cs.ucr.edu/~eamonn/MP_journal.pdf

  2. Lyon, et al. “Hellinger Distance Trees for Imbalanced Streams.” 2014. https://arxiv.org/pdf/1405.2278.pdf